The Annals of Applied Statistics 
2008, Vol. 2, No. 3, 1078-1102 
DOI: 10.1214/08-AOAS173 

© Institute of Mathematical Statistics, 2008 

ANALYSIS OF COMPARATIVE DATA WITH 
HIERARCHICAL AUTOCORRELATION 

By Cecile Ane 

University of Wisconsin — Madison 

The asymptotic behavior of estimates and information criteria in 
hnear models are studied in the context of hierarchically correlated 
sampling units. The work is motivated by biological data collected 
on species where autocorrelation is based on the species' genealog- 
ical tree. Hierarchical autocorrelation is also found in many other 
kinds of data, such as from microarray experiments or human lan- 
guages. Similar correlation also arises in ANOVA models with nested 
effects. I show that the best linear unbiased estimators are almost 
surely convergent but may not be consistent for some parameters 
such as the intercept and lineage effects, in the context of Brownian 
motion evolution on the genealogical tree. For the purpose of model 
selection I show that the usual BIG does not provide an appropriate 
approximation to the posterior probability of a model. To correct for 
this, an effective sample size is introduced for parameters that are in- 
consistently estimated. For biological studies, this work implies that 
tree-aware sampling design is desirable; adding more sampling units 
may not help ancestral reconstruction and only strong lineage effects 
may be detected with high power. 

1. Introduction. In many ecological or evolutionary studies, scientists 
collect "comparative" data across biological species. It has long been rec- 
ognized [Felsenstein (1985)] that sampling units cannot be considered inde- 
pendent in this setting. The reason is that closely related species are ex- 
pected to have similar characteristics, while a greater variability is expected 
among distantly related species. "Comparative methods" accounting for an- 
cestry relationships were first developed and published in evolutionary biol- 
ogy journals [Harvey and Pagel (1991)], and are now being used in various 
other fields. Indeed, hierarchical dependence structures of inherited traits 
arise in many areas, such as when sampling units are genes in a gene family 
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Fig. 1. Example of a genealogical tree from ^ units (left) and covariance matrix of vector 
Y under the Brownian motion model (right). 



[Gu (2004)], HIV virus samples [Bhattacharya et al. (2007)], human cul- 
tures [Mace and Holden (2005)] or languages [Pagel, Atkinson and Meade 
(2007)]. Such tree-structured units show strong correlation, in some way 
similar to the correlation encountered in spatial statistics. Under the spa- 
tial "infill" asymptotic where a region of space is filled in with densely 
sampled points, it is known that some parameters are not consistently es- 
timated [Zhang and Zimmerman (2005)]. It is shown here that inconsis- 
tency is also the fate of some parameters under hierarchical dependency. 
While spatial statistics is now a well recognized field, the statistical anal- 
ysis of tree-structured data has been mostly developed by biologists so 
far. This paper deals with a classical regression framework used to analyze 
data from hierarchically related sampling units [Martins and Hansen (1997), 
Housworth, Martins and Lynch (2004), Garland, Bennett and Rezende (2005), 
Rohlf (2006)]. 

Hierarchical autocorrelation. Although species or genes in a gene fam- 
ily do not form an independent sample, their dependence structure derives 
from their shared ancestry. The genealogical relationships among the units 
of interest are given by a tree (e.g.. Figure 1) whose branch lengths represent 
some measure of evolutionary time, most often chronological time. The root 
of the tree represents a common ancestor to all units considered in the sam- 
ple. Methods for inferring this tree typically use abundant molecular data 
and are now extensively developed [Felsenstein (2004), Semple and Steel 
(2003)]. In this paper the genealogical tree relating the sampled units is 
assumed to be known without error. 

The Brownian model (BM) of evolution states that characters evolve on 
the tree with a Brownian motion (Figure 2). After time t of evolution, the 
character is normally distributed, centered at the ancestral value at time 
and with variance proportional to t. Each internal node in the tree depicts 
a speciation event: an ancestral lineage splitting into two new lineages. The 
descendant lineages inherit the ancestral state just prior to speciation. Each 
lineage then evolves with an independent Brownian motion. The covariance 
matrix of the data at the n tips Y = (Yi, . . . , Yn) is then determined by the 



HIERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA 3 



tree and its branch lengths: 



Y~AA(/z,a2Vtrcc), 



where /i is the character value at the root of the tree. Components of Vtree 
are the times of shared ancestry between tips, that is, Vij is the length 
shared by the paths from the root to the tips i and j (Figure 1). The same 
structural covariance matrix could actually be obtained under other models 
of evolution, such as Brownian motion with drift, evolution by Gaussian 
jumps at random times or stabilizing selection in a random environment 
[Hansen and Martins (1996)]. The i.i.d. model is obtained with a "star" 
tree, where all tips are directly connected to the root by edges of identical 
lengths. Another model of evolution assumes an Ornstein-Uhlenbeck (OU) 
process and accounts for stabilizing selection [Hansen (1997)]. The present 
paper covers the assumption of a BM structure of dependence, although 
several results also apply to OU and other models. As the Brownian motion 
is reversible, the tree can be re-rooted. When the root is moved to a new 
node in the tree, the ancestral state represents the state of the character 
at that new node, so re-rooting the tree corresponds to a re-parametrization. 

The linear model. A frequent goal is to detect relationships between two 
or more characters or to estimate ancestral traits [Schluter et al. (1997), 
Pagel (1999), Garland and Ives (2000), Huelsenbeck and Bollback (2001), 
Blomberg, Garland and Ives (2003), Pagel, Meade and Barker (2004)]. In 
this paper I consider the linear model Y = X/3 + e with e ~ AA(0, cr^ Vtrcc) 
as derived from a BM evolution on the tree. When the matrix of predictors 
X is of full rank k, it is well known that the best linear unbiased estimator 
for (3 is 



/3 = (x*v,-;Lx) x^v^-^Y. 
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Fig. 2. Simulation of BM evolution along the tree in Figure 1. Ancestral state was = 10. 
Observed values of Y are marked by points. 
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Random covariates are typically assumed to evolve with a BM on the same 
tree as Y. Fixed covariates are also frequently considered, such as deter- 
mined by a subgroup of tips. 

Although this model has already been used extensively, the present pa- 
per is the first one to address its asymptotic properties. For a meaningful 
asymptotic framework, it is assumed that the root of the tree is fixed while 
units are added to the sample. The reason is that the intercept relates to 
the ancestral state at the root of the tree. If the root is pushed back in time 
as tips are added to the tree, then the meaning of the intercept changes 
and there is no hope of consistency for the intercept. The assumption of a 
fixed root is just a rooting requirement. It does not prevent any unit to be 
sampled. 

Asymptotic results assume the sample size goes to infinity. I argue here 
that this is relevant in real biological studies. For instance, studies on phylo- 
genetically related viral samples have included hundreds of samples 
[Bhattacharya et al. (2007)] . Pagel, Atkinson and Meade (2007) have built 
and used a tree relating as many as 87 Indo-European languages. Many 
groups count an incredibly large number of species. For instance, there are 
about 20,000 orchid species to choose from [Dressier (1993)], over 10,000 
species of birds [J0nsson and Fjeldsa (2006)], or about 200 wild potato 
species [Spooner and Hijmans (2001)]. In addition, studies can consider sub- 
populations and even individuals within species, so long as they are related 
by a divergent tree. 

Organization. The main results are illustrated on real examples in Sec- 
tion 2. It is shown that /3 is convergent almost surely and in norm in 
Section 3. In Section 4 then, I show that some components of /3 are not con- 
sistent, converging to some random value. This is typically the case of the 
intercept and of lineage effect estimators, while estimates of random covari- 
ate effects are consistent. I investigate a sampling strategy — unrealistic for 
most biological settings — where consistency can be achieved for the inter- 
cept in Section 4. With this sampling strategy, I show a phase transition for 
the rate of convergence: if branches are not sampled close to the root of the 
tree fast enough, the rate of convergence is slower than the usual ^/n rate. 
In Section 5 I derive an appropriate formula for the Bayesian Information 
Criterion and introduce the concept of effective sample size. Applications to 
biological problems are discussed in Section 6, as well as applications to a 
broader context of hierarchical models such as ANOVA. 

2. Illustration of the main results. Davis et al. (2007) analyzed flower 
size diameter from n = 25 species. Based on the plants' tree (Figure 3 left) 
assuming a simple BM motion with no shift, calculations yield an effective 
sample size rig = 5.54 for the purpose of estimating flower diameter of the 
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ancestral species at the root. This is about a 4- fold decrease compared to the 
number of 25 species, resulting in a confidence interval over 2 times wider 
than otherwise expected from n = 25 i.i.d. sampling units. The analysis of a 
larger tree with 49 species [Garland et al. (1993)] shows an 8-fold decrease 
with rig = 6.11. Section 4 shows this is a general phenomenon: increasing the 
sample size n cannot push the effective sample size rif, associated with the 
estimation of ancestral states beyond some upper bound. More specifically, 
Section 4 shows that rig < kT/t, where k is the number of edges stemming 
from the root, t is the length of the shortest of these edges and T is the 
distance from the root to the tips (or its average value). To account for 
autocorrelation, Paradis and Claude (2002) introduced a degree of freedom 
dfp = L/T, where L is the sum of all branch lengths. Interestingly, He is 
necessarily smaller than df p when all tips of the tree are at equal distance 
T from the root (see Appendix A). 

Unexpectedly large confidence intervals are already part of biologists' ex- 
perience [Schluter et al. (1997)]. As Cunningham, Omland and Oakley (1998) 
put it, likelihood methods have "revealed a surprising amount of uncertainty 
in ancestral reconstructions" to the point that authors may be tempted to 
prefer methods that do not report confidence intervals [McArdle and Rodrigo 
(1994)] or to ignore autocorrelation due to shared ancestry [Martins (2000)]. 
Still, reconstructing ancestral states or detecting unusual shifts between two 
ancestors are very frequent goals. For example, Hansen (1997) hypothesized 
a shift in tooth size to have occurred along the ancient lineage separat- 
ing browsing horses and grazing horses. Recent micro-array data from gene 
families have inferred ancestral expression patterns, as well as shifts that 
possibly occurred after genes were duplicated [Gu (2004)] . Guo et al. (2007) 
have estimated shifts in brain growth along the human lineage and along 
the lineage ancestral to human/chimp. Sections 3 and 4 show that under the 
BM model ancestral reconstructions and shift estimates are not consistent, 
but are instead convergent toward a random limit. This is illustrated by 
small effective sample sizes associated with shift estimators. Among the 25 
plant species sampled by Davis et al. (2007), 3 parasitic Rafftesiaceae species 
have gigantic flowers (in bold in Figure 3). Under a BM model with a shift 
on the Rafftesiaceae lineage, the effective sample sizes for the root's state 
(rig = 3.98) and for the shift (rig = 2.72) are obtained from the Rafftesiaceae 
subtree and the remaining subtree. These low effective sample sizes suggest 
that only large shifts can be detected with high power. 

The potential lack of power calls for optimal sampling designs. Trees are 
typically built from abundant and relatively cheap molecular sequence data. 
More and more often, a tree comprising many tips is available, while traits 
of interest cannot be collected from all tips on the tree. A choice has to 
be made on which tips should be kept for further data collection. Until 
recently, investigators did not have the tree at hand to make this choice, 
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Fig. 3. Phylogenetic trees from Davis et al. (2007) with 25 plant species, rie = 5.54 (left) 
and from Garland et al. (1993) with 49 mammal species, — 6.11 (right). Bottom: effec- 
tive sample size for sub-samples of a given size. Vertical bars indicate 95% confidence 
interval and median Ue values when tips are selected at random from the plant tree (left) 
and mammal tree (right). Dots indicate optimal rie values. 



but now most investigators do. Therefore, optimal sampling design should 
use information from the tree. Figure 3 shows the effective sample size Ue 
associated with the root's state in the simple BM model. First, sub-samples 
were formed by randomly selecting tips and He was calculated for each sub- 
sample. Since there can be a huge number of combinations of tips, 1000 
random sub-samples of size k were generated for each k. Median and 95% 
confidence intervals for Ue values are indicated by vertical bars in Figure 
3. Second, the sub-samples of a size k that maximize the effective sample 
size Tig were obtained using step-wise backward and forward searches. Both 
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search strategies agreed on the same maximal rie values, which are indicated 
with dots in Figure 3. From both trees, only 15 tips suffice to obtain a 
near maximum effective sample size, provided that the selected tips are 
well chosen, not randomly. The proposed selection of tips maximizes Ue 
and is based on the phylogeny only, prior to data collection. In view of the 
bound for rig mentioned above, the selected tips will tend to retain the k 
edges stemming from the root and to minimize the length of these edges by 
retaining as many of the early branching lineages as possible. 

For the purpose of model selection, BIC is widely used [Schwarz (1978), 
Kass and Raftery (1995), Butler and King (2004)] and is usually defined as 
—2\nL{P,a) +plog(n), where L{P,(t) is the maximized likelihood of the 
model, p the number of parameters and n the number of observations. Each 
parameter in the model is thus penalized by a log(n) term. Section 6 shows 
that this formula does not provide an approximation to the model posterior 
probability. Instead, the penalty associated with the intercept and with a 
shift should be bounded, and log(l + ne) is an appropriate penalty to be used 
for each inconsistent parameter. On the plant tree, the intercept (ancestral 
value) should therefore be penalized by log(l + 5.54) in the simple BM model. 
In the BM model that includes a shift along the parasitic plant lineage, the 
intercept should be penalized by ln(l + 3.98) and the shift by ln(l + 2.72). 
These penalties are AlC-like (bounded) for high-variance parameters. 

3. Convergence of estimators. This section proves the convergence of 
P = Z^^"-* as the sample size n increases. The assumption of a fixed root 
implies that the covariance matrix Vtrco = V„ (indexed by the sample size) 
is a submatrix of V„+i. 

Theorem 1. Consider the linear model Yi = Xj/3-|-ej with 

e(") = (£!,...,£„)* ~AA(0,ct2v„) 

and where predictors X may be either fixed or random. Assume the design 
matrix X^"^ (with Xj for ith row) is of full rank provided n is large enough. 

Then the estimator /3„ = (xWV„"1x("))~^X(")V„"^Y(") is convergent 
almost surely and in . Component Pn.j converges to the true value Pj 
if and only if its asymptotic variance is zero. Otherwise, it converges to a 
random variable which depends on the tree and the actual data. 

Note that no assumption is made on the covariance structure V„, except 
that it is a submatrix of V„+i. Therefore, Theorem 1 holds regardless of how 
the sequence V„ is selected. For instance, it holds for the OU model, whose 
covariance matrix has components Vij = e~""^'-i or Vij = (1 — ^-^'^Uj^j^-adij 
(depending whether the ancestral state is conditioned upon or integrated 
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out), where dij is the tree distance between tips i and j, and a is the known 
selection strength. 

Theorem 1 can be viewed as a strong law of large numbers: in the absence 
of covariates and in the i.i.d. case /3n is just the sample mean. Here, in 
the absence of covariates /3„ is a weighted average of the observed values, 
estimating the ancestral state at the root of the tree. Sampling units close to 
the root could be provided by fossil species or by early viral samples when 
sampling spans several years. Such units, close to the root, weigh more in 
Pn than units further away from the root. Theorem 1 gives a law of large 
number for this weighted average. However, we will see in Section 4 that the 
limit is random: /3„ is inconsistent. 

Proof of Theorem 1. The process e = (ei,e2, • • •) is well defined on 
a probability space because the covariance matrix V„ is a submatrix of 
V„+i. Derivations below are made conditional on the predictors X. In a 
Bayesian-like approach, the probability space is expanded to O = M'^ x by 
considering /3 G M'^ as a random variable, independent of errors e. Assume 
a priori that /? is normally distributed with mean and covariance matrix 
(T^Ife, Ifc being the identity matrix of size k. Let J-n be the filtration gener- 
ated by Yi, . . . ,Yn- Since l3,Yi,Y2, ■ ■ ■ is a Gaussian process, the conditional 
expectation E(/?|.7>^) is a linear combination of Yi, . . . ,y„ up to a constant: 

E(/3|.F„)=a„ + M„Y("). 

The almost sure converge of Pn will follow from the almost sure convergence 
of the martingale M{P\Tn) and from identifying M„Y(") with a linear trans- 
formation of Pn- The vector a„ and matrix M„ are such that E(/?|.7>i) is the 
projection of P on in L^(rj), that is, these coefficients are such that 

trace(E(/3 - a„ - M„y("))(/3 - a„ - M^Y^"))*) 

is minimum. Since Yi = Xj/3 + e^, /3 is centered and independent of e, we get 
that a„ = and the quantity to be minimized is 

tr((Ifc - M„XW) var(/3)(Ifc - M„XW)*) + tr(M„ var(e("))M^). 

The matrix M„ appears in the first term through M„X'^"-' , so we can mini- 
mize (T^tr(M„V„M^) under the constraint that B = MnX*^'") is fixed. Using 
Lagrange multipliers, we get M„V„ = AX^")* subject to M„X(") = B. As- 
suming XW*V-iX(") is invertible, it follows A = B(X(")*V-iX("))-i and 
M„Y(") = B/3("). The minimum attained is then tr^ tr(B(X(")*V-iX("))-iB*). 
This is necessarily smaller than o"^ tr(MV„M*) when M is formed by M„_i 
and an additional column of zeros. So for any B, the trace of B(X^"'^*V~^X('"^)~"^B* 
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is a decreasing sequence. Since it is also nonnegative, it is convergent and 
so IS (XW V;^iX("))"^ Now the quadratic expression 

tr((Ifc - B)(Ifc - B)*) + tr(B(x(")*V-ix("))~'B*) 

is minimized if B satisfies B(Ifc + (X(")*V~'^X(")) ^) = I^.. Note tlie symmet- 
ric definite positive matrix + (X(")*V~'^X("))~^ was shown above to be 
decreasing with n. In summary, E(/5|J^„) = (1^ + (X(")*V^ix("))-^)-^/3W. 
This martingale is bounded in L^(rj) so it converges almost surely and in 
L'^in) toE(/3|^oo). Finally, /?(")-/?= (1^ + (X(")*V-^ixW)"')E(/3|^„) - /3 
is also convergent almost surely and in iF'iQ.). But /3„ — /? is a function of 
iv in the original probability space independent of jS. Therefore, for any 
/?, /^^""^ converges almost surely and in L'^{Q). Since e is a Gaussian process, 
the limit of /J^"^ is normally distributed with covariance matrix the limit of 
(XW*V-1X("))~\ It follows that (5^^\ which is unbiased, converges to the 

true f3k if and only if the /cth diagonal element of (X(")*V~^X("')) ^ goes to 
0. □ 

4. Consistency of estimators. In this section I prove bounds on the vari- 
ance of various effects (3i. From Theorem 1 we know that /3i is strongly 
consistent if and only if its variance goes to zero. 

4.1. Intercept. Assume here that the first column of X is the column 1 
of ones, and the first component of (3, the intercept, is denoted /3o- 

Proposition 2. Let k be the number of daughters of the root node, 
and let t be the length of the shortest branch stemming from the root. Then 
var(/3o) > a'^t/k. In particular, when the tree is binary we have var(/3o) > 
aH/2. 

The following inconsistency result follows directly. 

Corollary 3. // there is a lower bound t > on the length of branches 
stemming from the root, and an upper bound on the number of branches 
stemming from the root, then (3q is not a consistent estimator of the intercept, 
even though it is unbiased and convergent. 

The conditions above are very natural in most biological settings, since 
most ancient lineages have gone extinct. The lower bound may be pushed 
down if abundant fossil data is available or if there has been adaptive radi- 
ation with a burst of speciation events at the root of the tree. 
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Proof of Proposition 2. Assuming the linear model is correct, the 
variance of /3 is given by var(/3) = a"^(X*V~-'^X)~^, where the first column 
of X is the vector 1 of ones, so that the variance of the intercept estimator 
is just the first diagonal element of (t2(X*V~1X)~^ But (X^V'^X)^^^ > 
(Xj*V~'^Xj)~^ for any index i [Rao (1973), 5a.3, page 327], so the proof can 
be reduced to the simplest case with no covariates: Yi = (3q + Ei. The basic 
idea is that the information provided by all the tips on the ancestral state /?o 
is no more than the information provided just by the k direct descendants 
of the root. Let us consider Zi,...,Zk to be the character states at the k 
branches stemming from the root after a time t of evolution (Figure 4, left). 

These states are not observed, but the observed values Yi,...,Yn have 
evolved from Zi, . . . , Zk- Now I claim that the variance of f3o obtained from 
the Y values is no smaller than the variance of /3q^^ obtained from the Z 
values. Since the Z values are i.i.d. Gaussian with mean Pq and variance 
a'^t, Pq^^ has variance a'^t/k. To prove the claim, consider Pq ~AA(0, cr^) 
independent of e. Then K{I3q\Yi, . . . ,Yn, Zi, . . . , Z^) = E,{Pq\Zi, . . . , Zk) so 
that var(E(/?o|yi, . . . , l^n)) < var(E(/3o|-Z'i, . . . , Z^)). The proof of Theorem 
1 shows that K{Po\Yi, . . . ,Yn) = (3o/ {I + ty) where = (1*V~^1)~^ so, sim- 
ilarly, K{Po\Zi, . . . , Zfc) = pjf^ /{I + tz) where = t/k. Since /3o and $0 — (3o 
are independent, the variance of K{Pq\Yi, . . . ,Yn) is (cr^ + ty(T^)/(l + t^)^ = 
o"^/(l + ty). The variance of E(/?o|.Z'i, . . . , Z^) is obtained similarly and we get 
l/{l + ty) < l/{l + tz), that is, ty > tz and var(/3o) = cr^(l*V-il)"^ > aH/k. 
□ 

4.2. Lineage effect. This section considers a predictor Xi that defines 
a subtree, that is, Xu = 1 if tip i belongs to the subtree and otherwise. 
This is similar to a 2-sample comparison problem. The typical "treatment" 
effect corresponds here to a "lineage" effect, the lineage being the branch 
subtending the subtree of interest. If a shift occurred along that lineage. 




Fig. 4. Left: Observed states are Yi, . . . ,Yn, while Zi, . . . , Zk are the unobserved states 
along the k edges branching from the root, after time t of evolution. Y provides less 
information on f3o than Z. Right: Zo, Zi, . . . , Zkt^p o.^^ unobserved states providing more 
information on the lineage effect [3i than the observed Y values. 
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Fig. 5. Model Mo (left) and M\ (right) with a lineage effect, Xi is the indicator of a 
subtree. Model Mi conditions on the state at the subtree's root, modifying the dependence 
structure. 

tips in the subtree will tend to have, say, high trait values relative to the 
other tips. However, the BM model does predict a change, on any branch 
in the tree. So the question is whether the actual shift on the lineage of 
interest is compatible with a BM change, or whether it is too large to be 
solely explained by Brownian motion. Alternatively, one might just estimate 
the actual change. 

This consideration leads to two models. In the first model, a shift /3i = 
(S) 

/3^Qp is added to the Brownian motion change along the branch of interest, 

(s) 

so that /J^op represents the character displacement not due to BM noise. 

( S 

In the second model, /?i = /^^op is the actual change, which is the sum 
of the Brownian motion noise and any extra shift. Observations are then 
conditioned on the actual ancestral states at the root and the subtree's root 
(Figure 5). By the Markov property, observations from the two subtrees 
are conditionally independent of each other. In the second model then, the 
covariance matrix is modified. The models are written 

Y = l/?o + Xi/3i + --- + Xfc/3fc + e 

with /?! = /?^Qp and e ~ AA(0, ci^Vtrcc) in the first model, while /3i = P^^^ 
and £ ~ A^(0, o"^ diag(Vtop, Vbot)) in the second model, where Vtop and Vbot 
are the covariance matrices associated with the top and bottom subtrees 
obtained by removing the branch subtending the group of interest (Figure 
5). 

Proposition 4. Let fctop be the number of branches stemming from the 
subtree of interest, ttop the length of the shortest branch stemming from the 
root of this subtree, and ti the length of the branch subtending the subtree. 
Then 

var(/3t^fj]) > (7^{ti + ttop/hop) and vaT0l^f^) > cr^ttop/^top- 
Therefore, if ttop/^top remains bounded from below when the sample size 

increases, both estimators /3^^p (pure shift) and P^^p^ (actual change) are 
inconsistent. 
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From a practical point of view, unless fossil data is available or there was 
a radiation (burst of speciation events) at both ends of the lineage, shift 
estimators are not consistent. Increasing the sample size might not help 

detect a shift as much as one would typically expect. 

(s) 

Note that the pure shift P^^^ is confounded with the Brownian noise, so 
it is no wonder that this quantity is not identifiable as soon as ti > 0. The 
advantage of the first model is that the BM with no additional shift is nested 
within it. 

Proof of Proposition 4. In both models var(/3i) is the second diago- 
nal element of (T^(X*V~^X) ^ which is bounded below by (T^(Xi*V~^Xi) ^, 
so that we need just prove the result in the simplest model Y = /3iXi + e. 
Similarly to Proposition 2, define Zi,. . . , Z^^^^ as the character states at the 
/ctop direct descendants of the subtree's root after a time ttop of evolution. 
Also, let Zq be the state of node just parent to the subtree's root (see Fig- 
ure 4, right). Like in Proposition 2, it is easy to see that the variance of 
/?! given the Y is larger than the variance of f3i given the Zq, Zi, . . . , Zj.^^^. 

In the second model, Pi = Pl^^ is the actual state at the subtree's root, 

so Zi , . . . , Zfc^^p are i.i.d. Gaussian centered at p[op^ with variance cr^ttop 

and the result follows easily. In the first model, the state at the subtree's 

(S) "(S) 
root is the sum of Zq, Pl^^ and the BM noise along the lineage, so Pl^^ = 

(s) 

{Z\ -1- h Zfcj^p ) //ctop — Zq. This estimate is the sum of /Stop; the BM noise 

and the sampling error about the subtree's root. The result follows because 
the BM noise and sampling error are independent with variance a^tx and 
<7^ttop/A:top respectively. □ 

4.3. Variance component. In contrast to the intercept and lineage effects, 
inference on the rate of variance accumulation is straightforward. An 
unbiased estimate of o"^ is 

= RSS/(n - k) = (Y - Y)V-L(Y - Y)/(n - k), 

where Y = X/3 are predicted values and n is the number of tips. The classical 
independence of and P still holds for any tree, and (n — k)a'^/a'^ follows 
a Xn-k distribution, k being the rank of X. In particular, o"^ is unbiased 
and converges to almost surely as the sample size increases, as shown 
in Appendix B. Although not surprising, this behavior contrasts with the 
inconsistency of the intercept and lineage effect estimators. We keep in mind, 
however, that the convergence of may not be robust to a violation of the 
normality assumption or to a misspecification of the dependence structure, 
either from a inadequate model (BM versus OU) or from an error in the 
tree. 
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4.4. Random covariate effects. In this section X denotes the matrix of 
random covariates, excluding the vector of ones or any subtree indicator. In 
most cases it is reasonable to assume that random covariates also follow a 
Brownian motion on the tree. Covariates may be correlated, accumulating 
covariance fS on any single edge of length t. Then covariates j and k have 
covariance Sj^Vtrcc- With a slight abuse of notation (considering X as a 
single large vector), var(X) = S (8) Vtree- 

Proposition 5. Consider Y = l/3o + X/3i + e with e ~Ar(0,c7^ Vtree)- 
Assume X follows a Brownian evolution on the tree with nondegenerate 
covariance I].- X ~ J\f{fix, S ig) Vtree)- Then var(/3i) ~ cr^S~^/n asymptot- 
ically. In particular, Pi estimates Pi consistently by Theorem 1. Random 
covariate effects are consistently and efficiently estimated, even though the 
intercept is not. 

Proof. We may write V~^ = R*R using the Cholesky decomposition 
for example. Since Rl 7^ 0, we may find an orthogonal matrix O such that 
ORl = (a,0,...,0)* for some a, so without loss of generality, we may assume 
that Rl = (a, 0, ... , 0)*. The model now becomes RY = Rl/3o + RX/3i + 
Re, where errors Re are now i.i.d. Let Xq be the first row of RX and 
let Xi be the matrix made of all but the first row of RX. Similarly, let 
{yo,YiY = RY and {eQ,£iY = Re. The model becomes Yi = Xi/3i + ei, 

2/0 = a/5o + Xo/3i + eo with least square solution Pi = (X^Xi) X|Yi = 
Pi + (X^Xi) X*ei and Pq = {yo — Xo/3i)/a. The variance of Pi conditional 
on X is then it^(X^Xi)~^. Using the condition on Rl, the rows of Xi are 
i.i.d. centered Gaussian with variance-covariance XI and (X^Xi)~^ has an in- 
verse Wishart distribution with n — 1 degrees of freedom [Johnson and Kotz 
(1972)]. The unconditional variance of var(/3i) is then a^E(X.{Xiy^ = 
(7^5]"^/(n — k — 2), where k is the number of random covariates, which 
completes the proof. □ 

Remark. The result still holds if one or more lineage effects are in- 
cluded and if the model conditions upon the character state at each subtree 
(second model in Section 4.2). The reason is that data from each subtree are 
independent, and in each subtree the model has just an intercept in addition 
to the random covariates. 

The behavior of random effect estimators contrasts with the behavior of 
the intercept or lineage effect estimators. An intuitive explanation might be 
the following. Each cherry in the tree (pair of adjacent tips) is a pair of 
siblings. Each pair provides independent evidence on the change of Y and 
of X between the 2 siblings, even though parental information is unavail- 
able. Even though means of X and Y are poorly known, there is abundant 
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evidence on how they change with each other. Similarly, the method of inde- 
pendent contrasts [Felsenstein (1985)] identifies n — 1 i.i.d. pair-like changes. 

5. Phase transition for symmetric trees. The motivation for this section 
is to determine the behavior of the intercept estimator when branches can 
be sampled closer and closer to the root. I show that the intercept can be 
consistently estimated, although the rate of convergence can be much slower 
than root n. The focus is on a special case with symmetric sampling (Figure 
6). The tree has m levels of internal nodes with the root at level 1. All 
nodes at level i share the same distance from the root ti + • — h ti-i and the 
same number of descendants dj. In a binary tree all internal nodes have 2 
descendants and the sample size is n = 2™. The total tree height is set to 1, 
that is, ti + • ■ • + tm = 1. 

With these symmetries, the eigenvalues of the covariance matrix V„ 
can be completely determined (see Appendix C), smaller eigenvalues be- 
ing associated with shallower internal nodes (close to the tips) and larger 
eigenvalues being associated with more basal nodes (close to the root). 
In particular, the constant vector 1 is an eigenvector and (1*V„1) = 
ti/di H \-tm/{di . ..dm). 

In order to sample branches close to the root, consider replicating the 
major branches stemming from the root. Specifically, a proportion q of each 
of these di branches is kept as is by the root, and the other proportion 1 — q 
is replicated along with its subtending tree (Figure 6), that is, = q'^~^ 
and t["^^ = (1 — q)q^~''' for « = 2, . . . ,m. For simplicity, assume further that 
groups are replicated d>2 times at each step, that is, di = ■ ■ ■ = dm = d. 




Fig. 6. Symmetric sampling (left) and replication of major branches close to the root 
(right). 
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The result below shows a law of large numbers and provides the rate of 
convergence. 

Proposition 6. Consider the model with an intercept and random co- 
variates Y = 1(3q + X/?i + e with e ~ AA(0, cr^V„) on the symmetric tree 
described above. Then (3q is consistent. The rate of convergence experiences 
a phase transition depending on how close to the root new branches are 
added: var(/3o) is asymptotically proportional to if Q < ^/d, ln(n)n~'^ if 
q= l/d or n°' if q > 1/d where a = ln(g)/ln(d). Therefore, the root-n rate of 
convergence is obtained as in the i.i.d. case if q < l/d. Convergence is much 
slower if q> l/d. 

Proof. By Theorem 1, the consistency of Pq follows from its variance 
going to 0. First consider the model with no covariates. Up to <t^, the vari- 
ance of is {l''^\'nl-)~^ = ti/di + ■ ■ ■ + tm/{di . . .dm), which is q"^~^/d + 
(1 - q){l - {qd)"'-^)/{d"'{l - qd)) if / 1 and (1 + (1 - q){m - 
if qd= 1. The result follows easily since n = ti™, m oc ln(n) and = n". 
In the presence of random covariates, it is easy to see that the variance of 
/3o is increased by var(/ix(/3i — /3i)), where jlx = Xi/a is the row vector of 
the covariates' estimated ancestral states (using notations from the proof of 
Proposition 5). By Proposition 5 this increase is 0{n~^), which completes 
the proof. □ 

6. Bayesian information criterion. The basis for using BIC in model 
selection is that it provides a good approximation to the marginal model 
probability given the data and given a prior distribution on the parameters 
when the sample size is large. The proof of this property uses the i.i.d. 
assumption quite heavily, and is based on the likelihood being more and 
more peaked around its maximum value. Here, however, the likelihood does 
not concentrate around its maximum value since even an infinite sample size 
may contain little information about some parameters in the model. The 
following proposition shows that the penalty associated with the intercept 
or with a lineage effect ought to be bounded, thus smaller than log(n). 

Proposition 7. Consider k random covariates X with Brownian evo- 
lution on the tree and nonsingular covariance S, and the linear models 

Y = /?ol + X/3i+e with er^N{<d, a'^W tree) (Mq) 

Y = /3ol + X/3i + /?topltop + e with er^M{d,a'^Y tree), (Mi) 

where the lineage factor Itop is the indicator of a (top) subtree. Assume a 
smooth prior distribution ir over the parameters = (/?, a) and a sampling 
such that 1*V~^1 is bounded, that is, branches are not sampled too close 
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from the root. With model Mi assume further that branches are not sampled 
too close from the lineage of interest, that is, l*QpV~^ltop is bounded. Then 
for both models, the marginal probability of the data ¥(Y) = j¥{Y\9)Tr{9) d9 
satisfies 

-21ogP(y) = -2\nL{e) + {k + l)ln(n) + 0(1) 

as the sample size increases. Therefore, the penalty for the intercept and for 
a lineage effect is bounded as the sample size increases. 

The poorly estimated parameters are not penalized as severely as the 
consistently estimated parameters, since they lead to only small or moderate 
increases in likelihood. Also, the prior information continues to influence the 
posterior of the data even with a very large sample size. Note that the lineage 
effect /3top may either be the pure shift or the actual change. Model Mq is 
nested within Mi in the first case only. 

In the proof of Proposition 7 (see Appendix D) the 0(1) term is shown 
to be dominated by 

C = log det S - (A; + 1) log(27ra2) + log 2 + D, 

where D depends on the model. In Mq 

(1) D = -2logf exp{-{(3o-pof/{2toa^))7riPo,Pi,a)dPo, 
where to = lim(l*V^4)"^ In Mi 

(2) D = -2logf exp(-^*W-i/3/(2a2))7r(/3o,Aop,^i,^)d/3od/?top, 

where /3* = (/3o — /3o) Aop — Aop) and the 2x2 symmetric matrix has 
diagonal elements liml*V~^l = t^^, lim l*QpV~^ltop < oo and off-diagonal 
element lim l*V~^ltop, which does exist. 

In the rest of the section I assume that all tips are at the same distance 
T from the root. This condition is realized when branch lengths are chrono- 
logical times and tips are sampled simultaneously. Under BM, Yi,...,Yn 
have common variance a^T. The ancestral state at the root is estimated 
with asymptotic variance cj^/lim„ 1*V~^1, while the same precision would 
be obtained with a sample of rig independent variables where 

ne = riim 1*V~4. 

Therefore, I call this quantity the effective sample size associated with the 
intercept. 

The next proposition provides more accuracy for the penalty term in case 
the prior has a specific, reasonable form. In some settings, it has been shown 
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that the error term in the BIC approximation is actually better than 0(1). 
Kass and Wasserman (1995) show this error term is only 0{n~^^'^) if the 
prior carries the same amount of information as a single observation would, 
as well as in the context of comparing nested models with an alternative hy- 
pothesis close to the null. I follow Kass and Wasserman (1996) and consider 
a "reference prior" that contains little information, like a single observation 
would [see also Raftery (1995, 1996), Wasserman (2000)]. In an empirical 
Bayes way, assume the prior is Gaussian centered at 9. Let (/3i,cr) have 
prior variance = diag((T^l]~^, a^/2) and be independent of the other 
parameter (s) /3o and Ptop- Also, let /3q have variance a'^T in model Mq. 

In model Mi, assume further that the tree is rooted at the base of the 
lineage of interest, so that the intercept is the ancestral state at the base 
of that lineage. This reparametrization has the advantage that Pq and Pq + 
Ptop are uncorrelated asymptotically. A single observation from outside the 
subtree of interest (i.e., from the bottom subtree) would be centered at Po 
with variance cr'^T, while a single observation from the top subtree would 
be centered at Po + /3top with variance cj^Ttop. In case /3top is the pure shift, 
then Ttop = T. If /?top is the actual change along the lineage, then Ttop is 
the height of the subtree excluding its subtending branch. Therefore, it is 
reasonable to assign (/3o) Aop) a prior variance of (T^W,r with 




-T T + Ttopj' 



The only tips informing Pq + /?top are those in the top subtree and the only 
units informing Pq are those in the bottom subtree. Therefore, the effective 
sample sizes associated with the intercept and lineage effects are defined as 

ne,hot = niml*Vj;^\l, rLe,top = Ttop liml*Vj";pl, 

where Vbot and Vtop are the variance matrices from the bottom and top 
subtrees. 



Proposition 8. Consider models Mq and Mi and the prior specified 
above. T/ien P(y|Mq) = -2 In L(^|Mo) + (/c + 1) ln(n) + ln(l + n^) + o(l) and 
¥{Y\Mi) = -2 InL(^lMi) + {k + 1) ln(n) +ln(l + 7ie,bot) + ln(l + n,,top) + o(l) . 
Therefore, a reasonable penalty for the nonconsistently estimated parameters 
is the log of their effective sample sizes plus one. 



Proof. With model Mq, we get from (1) 

n 91 M 91 /■ f (/^o - /^o)' {Po-Pof\dPQ 
D = -21og.(A,a) - 21ogy exp(^-^^-^^ ^_ 

= -21og^(/3i,a) + log(l + r/io). 



18 



C. ANE 



Now — 21og7r(/3i,(T) = (A; + l)log(27r) — logdet J„ cancels with the first con- 
stant terms to give C = log(l + T /to) = log(l + ng). With model Mi, we 
get 



-1/2 



detwy^ 

so that again C = log(det(W~^ + W~^) det W^r). It remains to identify this 
quantity with ln(l + ?ie,bot) + + ^e^top)- It is easy to see that det W,r = 
TTtop and 

TT I rp~l rp—l 

V top top 

Since V is block diagonal diag(Vtop, Vbot)) we have that l*V~-'^ltop = n.e,top/?top 
and l*V-4 = 1*V41 + l*Vb„\l = ne,top/Ttop + ne,bot/r. Therefore, W"! 
has diagonal terms rig top/Ttop + T^e,hot/T and ng^bot/^top and off-diagonal 
term ne,bot/7top. We get det(W"^ W;;^) = (ne,bot + l)/Jtop(ne,bot + 1)/^, 
which completes the proof. □ 

Akaike's information criterion (AIC). This criterion [Akaike (1974)] is 
also widely used for model selection. With i.i.d. samples, AIC is an estimate 
of the Kullback-Leibler divergence between the true distribution of the data 
and the estimated distribution, up to a constant [Burnham and Anderson 
(2002)]. Because of the BM assumption, the Kullback-Leibler divergence 
can be calculated explicitly. Using the Gaussian distribution of the data, the 
mutual independence of o"^ and /3 and the chi-square distribution of <t^ , the 
usual derivation of AIC applies. Contrary to BIC, the AIC approximation 
still holds with tree-structure dependence. 

7. Applications and discussion. This paper provides a law of large num- 
bers for non i.i.d. sequences, whose dependence is governed by a tree struc- 
ture. Almost sure convergence is obtained, but the limit may or may not 
be the expected value. With spatial or temporal data, the correlation de- 
creases rapidly with spatial distance or with time typically (e.g., AR pro- 
cesses) under expanding asymptotics. With a tree structure, the dependence 
of any 2 new observations from 2 given subtrees will have the same corre- 
lation with each other as with "older" observations. In spatial statistics, 
infill asymptotics also harbor a strong, nonvanishing correlation. This de- 
pendence implies a bounded effective sample size in most realistic bio- 
logical settings. However, I showed that this effective sample size pertains 
to locations parameters only (intercept, lineage effects). Inconsistency has 
also been described in population genetics. In particular, Tajima's estima- 
tor of the level of sequence diversity from a sample of re individuals is not 
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consistent [Tajima (1983)], while asymptotically optimal estimators only 
converge at rate log(n) rather than n [Fu and Li (1993)]. The reason is that 
the genealogical correlation among individuals in the population decreases 
the available information. 

Sampling design. Very large genealogies are now available, with hun- 
dreds or thousands of tips [Cardillo et al. (2005), Beck et al. (2006)]. It is 
not uncommon that physiological, morphological or other phenotypic data 
cannot be measured for all units in the group of interest. For the purpose 
of estimating an ancestral state, the sampling strategy suggested here max- 
imizes the scaled effective sample size 1*V~^1 over all subsamples of size n, 
where n is an affordable number of units to subsample. This criterion is a 
function of the rooted tree topology and its branch lengths. It is very easy to 
calculate with one tree traversal using Felsenstein's algorithm [Felsenstein 
(1985)], without inverting V„. It might be computationally too costly to as- 
sess all subsamples of size n, but one might heuristically search only among 
the most star-like subtrees. Backward and forward stepwise search strategies 
were implemented, either removing or adding tips one at a time. 

Desperate situation? This paper provides a theoretical explanation for 
the known difficulty of estimating ancestral states. In terms of detecting non- 
Brownian shifts, our results imply that the maximum power cannot reach 
100%, even with infinite sampling. Instead, what mostly drives the power of 
shift detection is the effect size: (3i/y/ia where /3i is the shift size and t is the 
length of the lineage experiencing the shift. The situation is desperate only 
in cases when the effect size is small. Increased sampling may not provide 
more power. 

Beyond the Brownian motion model. The convergence result applies to 
any dependence matrix. Bounds on the variance of estimates do not apply 
to the Ornstein-Uhlenbeck model, so it would be interesting to study the 
consistency of estimates in this model. Indeed, when selection is strong the 
OU process is attracted to the optimal value and "forgets" the initial value 
exponentially fast. Several studies have clearly indicated that some ancestral 
states and lineage-specific optimal values are not estimable [Butler and King 
(2004), Verdii and Gleiser (2006)], thus bearing on the question of how effi- 
ciently these parameters can be estimated. While the OU model is already 
being used, theoretical questions remain open. 

Broader hierarchical autocorrelation context. So far linear models were 
considered in the context of biological data with shared ancestry. However, 
implications of this work are far reaching and may affect common practices 
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Fig. 7. Trees associated with ANOVA models: 3 groups with fixed effects (left) or random 
effects (right). Variance within and among groups are a\ and respectively. 

in many fields, because tree structured autocorrelation underlies many ex- 
perimental designs. For instance, the typical ANOVA can be represented 
by a forest (with BM evolution), one star tree for each group (Figure 7). If 
groups have random effects, then a single tree captures this model (Figure 
7). It shows visually how the variation decomposes into within and among 
group variation. ANOVA with several nested effects would be represented 
by a tree with more hierarchical levels, each node in the tree representing 
a group. In such random (or mixed) effect models, asymptotic results are 
known when the number of groups becomes large, while the number of units 
per group is not necessarily required to grow [Akritas and Arnold (2000), 
Wang and Akritas (2004), Giiven (2006)]. The results presented here pertain 
to any kind of tree growth, even when group sizes are bounded. 

Model selection. Many aspects of the model can be selected for, such 
as the most important predictors or the appropriate dependence structure. 
Moreover, there often is some uncertainty in the tree structure or in the 
model of evolution. Several trees might be obtained from molecular data on 
several genes, for instance. These trees might have different topologies or 
just different sets of branch lengths. BIC values from several trees can be 
combined for model averaging. I showed in this paper that the standard form 
of BIC is inappropriate. Instead, I propose to adjust the penalty associated 
to an estimate with its effective sample size. AIC was shown to be still 
appropriate for approximating the Kullback-Leibler criterion. 

Open questions. It was shown that the scaled effective sample size is 
bounded as long as the number k of edges stemming from the root is bounded 
and their lengths are above some t > 0. The converse is not true in general. 
Take a star tree with edges of length v?. Then Yn ~ AA(/i, a^n^) are inde- 
pendent, and 1*V~^1 = X)^"^ is bounded. However, if one requires that the 
tree height is bounded (i.e., tips are distant from the root by no more than 
a maximum amount), then is it necessary to have k < oo and t > for the 
effective sample size to be bounded? If not, it would be interesting to know 
a necessary condition. 
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APPENDIX A: UPPER BOUND FOR THE EFFECTIVE SAMPLE 

SIZE 

I prove here the claim made in Section 2 that the effective sample size 
for the intercept rie = T1*V~^1 is bounded by dfp = L/T, where L is the 
tree length (the sum of all branch lengths), in case all tips are at equal 
distance T from the root. It is easy to see that V is block diagonal, each 
block corresponding to one subtree branching from the root. Therefore, 
is also block diagonal and, by induction, we only need to show that < L/T 
when the root is adjacent to a single edge. Let t be the length of this edge. 
When this edge is pruned from the tree, one obtains a subtree of length L — t 
and whose tips are at distance T — t from the root. Let V_j be the covariance 
matrix associated with this subtree. By induction, one may assume that 
l*V_a < (L - t)/{T - tf. Now V is of the form tJ + V_t, where J = 11* 
is a square matrix of ones. It is easy to check that V~^l = VZjl/(l + 
tl^^^Z]l) so that i*v-ii = ((i*viii)-i + < ((T - t)V(L -t)+ t)-K 
By concavity of the inverse function, ((1 — A)/a + X/b)"^ < (1 ~ '^)^ + 
all A in (0, 1) and all a > & > 0. Combining the two previous inequalities with 
\ = t/T,a = {L-t)/{T-t) and 5 = 1 yields l^V-^l < L/T^ and proves the 
claim. The equality ng = dfp only occurs when the tree is reduced to a single 
tip, in which case ng = 1 = dfp. 

APPENDIX B: ALMOST SURE CONVERGENCE OF a AND t 

Convergence of a in probability is obtained because va^ja^ has a chi- 
square distribution with degree of freedom v = n — r, r being the total 
number of covariates. The exact knowledge of this distribution provides 
bounds on tail probabilities. Strong convergence follows from the conver- 
gence of the series J2n'^{\^n~^'^\ > e) < o<d for all e > 0, which in turn follows 
from the application of Chernov's bound and derivation of large deviations 
[Dembo and Zeitouni (1998)]: F{a^-a^>e) < ex.p{-ul{e)) andF{a^-a^ < 
— e) < exp(— z^/(— e)) where the rate function /(e) = (e — log(l + e))/2 for all 
e > — 1 is obtained from the moment generating function of the chi-square 
distribution. 

The covariance matrix of random effects is estimated with = X* Xi = 
(X — /ix)*V~^(X — fix), with Xi as in the proof of Proposition 5, which 
has a Wishart distribution with degree of freedom i' = n — 1 and variance 
parameter For each vector c then, c^uYlnC has a chi-square distribution 
with variance parameter c*Sc, so that c*l]„c converges almost surely to 
c*Sc by the above argument. Using the indicator of the jth coordinate 
c = lj, then c = Ij -|- Ij, we obtain the strong convergence of S to 5]. 
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APPENDIX C: SYMMETRIC TREES 

With the symmetric samphng from Section 5, eigenvalues of V„ are of 
the form 

\di...di di...dm) 

with multiphcity di . . . di^i{di — 1), the number of nodes at level i if i > 2. 
At the root (level 1) the multiplicity is di. Indeed, it is easy to exhibit the 
eigenvectors of each Aj. Consider Ai for instance. The di descendants of the 
root define di groups of tips. If v is a vector such that Vj = Vk for tips j and k 
is the same group, then it is easy to see that V„v = Aiv. It shows that Ai is 
an eigenvalue with multiplicity di (at least). Now consider an internal node 
at level i. Its descendants form di groups of tips, which we name Gi , . . . , G^. . 
Let V be a vector such that vj = if tip j is not a descendant of the node 
and Vj = ttg if j is a descendant from group g. Then, if oi + • • • + a^. = 0, it 
is easy to see that V„vAjV. Since the multiplicities sum to n, all eigenvalues 
and eigenvectors have been identified. 

APPENDIX D: BIG APPROXIMATION 

Proof of Proposition 7. The prior vr is assumed to be sufficiently 
smooth (four times continuously differentiable) and bounded. The same con- 
ditions are also required for iTm defined by = sup^^ 7r(/3i, (j|/3o) in model 
Mq and TT-m = sup^jjjj^^^p 7r(/?i, cr|/3o, Aop) in model Mi. The extra assumption 
on is pretty mild; it holds when parameters are independent a priori, for 
instance. 

For model Mq the likelihood can be written 
-2 logL(0) = -2 log m + n (^^ - 1 - log ^) 

+ m - /3i)*X*V-iX(/?i - + l*V-4(/3o - Po? 

+ 2(/?o - /3o)l*V-iX(/?i - pi))/a\ 

Rearranging terms, we get — 21ogL(^) = — 21ogL(^) + 2nhn{0) + a„(/3o — 
Po)^/a^, where a„ = l^V-^l - l*V-iX(X*V-iX)"^X*V-il, 

2/ln(^) = — - 1 - log — + (/3l - ^(/3i - m) 

n \a'^ a'^ J 

and ui=i5i- {Po - /3o)(X*V-iX)~^X*V^il. For any fixed value of Pq, 
consider /i„ as a function of Pi and a. Its minimum is attained at ui and 



HIERARCHICAL AUTOCORRELATION IN COMPARATIVE DATA 



23 



= 0"^ + aniPo — $o)'^/n. At this point the minimum value is 2/i„(ni,(5"i) = 

log(l + a„(/3o — /3o)^/ d-niPo — /3o)^/ {na'^) and the second derivative of 
hn is J„ = diag(X*V-iX/(n<Tf),2/(7f). Note that fix = 1^V-^X/{1^V-^1) 
is the row vector of estimated ancestral states of X, so by Theorem 1, it is 
convergent. Note also that X*V~^X = (n — 1)S + {P\'~^l)iJ,x^fix- Since 
1*V~^1 is assumed bounded, X*V~^X = n5] + 0(1) almost surely, and the 
error term depends on X only, not on the parameters /? or u. Consequently, 
Qn = 1*V~^1 + 0{n~^) is almost surely bounded and af = a'^ + 0{n~^). It 
follows that for any fixed Pq, 3n converges almost surely to diag(S/(T^, 2/(T^). 
Therefore, its eigenvalues are almost surely bounded and bounded away 
from zero, and hn is Laplace-regular as defined in Kass, Tierney and Kadane 
(1990). Theorem 1 in Kass, Tierney and Kadane (1990) shows that 

-2 log J e-"^"7rd/?id(T 

= 2nhn{ui,ai) + (fc + 1) logn + logdet Si 

- (A; + 1) log(27ra2) + log 2 - 2 log(7r(/3i , a|/3o) + 0{n-^)) 

with Si = X*V,7-'^X/n = S + 0{n^^). Integrating further over /?o, we get 
-21ogP(y) = -21ogL(i9) + {k + 1) log n + logdet Si - {k + l)log(27ra2) + 
log 2 + Dn, where 

f / n-k-l ^OnifSo-PofW 
Dn = -21ogy exp(^ + na^ )) 

X (7r(/3i,<T|/3o) + 0(n-i))^(/3o)(i/3o. 

Heuristically, we see that converges to t^^ = liml*V~^l and for fixed Pq 
the integrand is equivalent to exp(— (/3o — /3o)^/(2io<7^)), so we would con- 
clude that Dn converges to D = —2 log / exp(— (/?o — /3o)^/(2to<5'^))'/r(/3o,/3i,d-) dPo 
as given in (1) and, thus, 

- 2 log P(y ) = -2 log L(^) (/c 1) log n log det S - (A; -M) log(27r(T2) 
+ log2 + D + o{l). 

Formally, we need to check that the 0{n~^) term in Dn has an o(l) contri- 
bution after integration, and that the limit of the integral is the integral of 
the point-wise limit. The integrand in Dn is the product of 

/n(/?o) =n('=+i)/2| exp(logL(0)-logL(^))7r(/3i,a|/3o)d/3i(ia 

and of a quantity that converges almost surely: (2 det Si)-'^/^(27r(T^)~('^+^)/^. 
Maximizing the likelihood and prior in Pq, we get that is uniformly 
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bounded in Pq by 

n('=+^)/2 / exp (-^ - 1 - log ^ + (/?! - f^i)%{f3i - ) 

where S2 = (XV^-'^X — l^W~'^l[i\fix) /n converges almost surely to S. 
Since vr-m is assumed smooth and bounded, we can apply Theorem 1 from 
Kass, Tierney and Kadane (1990) again, and fniPo) is bounded by 
(2det 5]2)~^/^(27r(T^)('^+^)/^ x TTm0i,d-) which is a convergent quantity. There- 
fore, fn is uniformly bounded and by dominated convergence, the limit of 
/ fnirdPo equals the integral of the point-wise limit so that D„ = D + o(l) 
as claimed in (1). 

For model Mi the proof is similar. The value ui is now f3i — (X*V^-'^X) ((/3o 
^o)X*y-il + (Aop - Aop)X*V-iltop). The term a„(/3o -/3o)^ is replaced by 
/3* A„/3, where /3* = (/3o — Po, Ptop — Aop) and A„ is the 2x2 symmetric ma- 
trix with diagonal elements an and l{;opV~^ltop — ltopV^-'^X(X*V~-'^X) ^X*V 

and off-diagonal element I'V-^ltop - l*V-iX(X*V"iX)"^X*V-^ltop. Note 
that, as before, elements in A„ are dominated by their first term, since 
X*V-^X = nT, + 0(1) almost surely. 

I show below that A.„ converges to W""'^ as defined in (2), whose elements 
are the limits of 1*V~^1, l*QpV~^ltop and l*V~^ltop. The first quantity is 
t^^, finite by assumption. The second quantity equals 1*V~^1, where V is 
obtained by pruning the tree from all tips not in the top subtree, so it con- 
verges and is necessarily smaller than t^^ . The third quantity exists because 
(1*V~^1) (l*V~^ltop) is Atop; the estimated state at the root from char- 
acter Itop- Theorem 1 cannot be applied to show its convergence, because 
Itop is a nonrandom character, but convergence follows from the following 
facts: (a) Atop is the estimated state at the root from a tree where the top 
subtree is reduced to a single "top" leaf whose subtending branch length 
decreases when more tips are added to the top subtree, to a nonnegative 
limit, (b) On the reduced tree, Atop is the weight with which the top leaf 
contributes to ancestral state estimation, (c) This weight decreases as more 
tips are added outside the top subtree. □ 

Acknowledgments. The author is very grateful to Thomas Kurtz for in- 
sightful discussions on the almost sure convergence result. 
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